This course covers the basics of reproducible research with R, Rstudio, and Github. My Github repository can be found here: https://github.com/viljahe/IODS-project
date()
## [1] "Thu Nov 19 15:37:30 2020"
I found this course by browsing courses for doctoral students, trying to find something that would actually benefit me in some way: I’m interested in open science, and while I have been using R for some years, Git is something I’m not yet familiar with but have been wanting to learn for a while now. So this was a good opportunity to do just that.
I’m looking forward to learning how Git can make my research more open and more reproducible. I hope I can adapt things I learn from this course as a part of my workflow. Three things I’m most looking forward to/goals for myself for this course:
This week we learned about data wrangling and linear regression. First, we created a data set to be used in the analyses below including creating some summary variables. Then we learned how to conduct a simple multivariate linear regression and how to assess the model diagnostics.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("GGally")
library("ggplot2")
First, let’s load the data set created for this exercise to R and inspect its dimension and structure using dim() and str():
learning2014 <- read.table("learning2014.txt")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
The data contains information from 166 students from an introductory statistics course. Data include participants’ gender and age, a measure of global attitudes towards statistics, participants’ scored points in the exam, and three variables which reflect deep learning, surface learning, and strategic learning. The three variables are the mean of several original items asking about participant’s learning approaches in each of the three categories, and the attitude variable is a mean of ten original variables asking about different attitudes towards statistics.
We can inspect the data and the associations between variables visually using the function ggpairs() and look at the most common descriptive statistics with the help of summary() to get a more complete grasp of the data.
p <- ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.1),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 2.5)),
) +
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=5),
axis.text.y = element_text(size=5))
p
summary(learning2014)
## gender age attitude points
## Length:166 Min. :17.00 Min. :1.400 Min. : 7.00
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:19.00
## Mode :character Median :22.00 Median :3.200 Median :23.00
## Mean :25.51 Mean :3.143 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:27.75
## Max. :55.00 Max. :5.000 Max. :33.00
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
The data seems to contain more women than men, and the participants are mostly under the age of 30. We can see that there are some differences between men and women on how the variables are distributed, most visibly in the global attitudes towards statistics where men have somewhat higher score on average, and the scores are distributed more evenly for women. Comparing the learning approaches, it would seem that deep learning scores skew a bit more towards the higher scores, whereas strategic scores are distributed more evenly around the middle. For men, the surface learning scores seem to be distributed more evenly than for women who have a more visible peak around the middle values of the variable.
There is a significant positive correlation between the global attitudes towards statistics and the exam score (\(r = .437\)), and this is also true for both gender groups. Attitudes towards statistics also seem to be slightly correlated with surface learning with more positive attitude being associated with lower score on surface learning (\(r = -.176\)), but upon closer inspection, it seems that there is no significant correlation for women, and for men the association is stronger than for all participants’ together (\(r = -.374\)). There is also an expected correlation between surface learning and deep learning, with higher scores on surface learning being associated with lower scores on deep learning (\(r = -.324\)), but surprisingly this effect is once again mostly driven by a strong association for men (\(r = -.622\)) whereas for women the learning approaches are not associated with each other.
The associations between variables can be further examined using linear regression. For example, we can see how exam score is explained by the variables that have the strongest correlation with it: attitudes towards statistics, the surface learning approach, and the strategic learning approach.
model1 <- lm(points ~ attitude + surf + stra, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 0.0000000193 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 0.00000003156
Based on the results of the regression, it would seem that more positive attitudes towards statistics is associated with higher scores on the exam (\(B = 3.40, p <.001\)). This means, that for each one point increase on the scale of attitudes, the exam score increases approximately 3.4 points. The two learning approaches included in the model do not seem to be associated with exam scores. They can therefore be removed from the model.
# remove surf first as it's p-value is the higher of the two
model2 <- update(model1, ~. -surf)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 0.00000000631 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 0.000000007734
# stra is still not significant, remove it as well
model3 <- update(model2, ~. -stra)
summary(model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 0.00000000195 ***
## attitude 3.5255 0.5674 6.214 0.00000000412 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 0.000000004119
With the surface learning approach first removed from the model, the p-value associated with strategic learning approach decreases, but the effect remains non-significant at the traditional significance level of \(p < .05\). Removing both non-significant predictors from the model does not affect the association between attitudes towards statistics and exam scores greatly. More positive attitudes towards statistics are associated with higher exam scores, and the magnitude of the effect remains roughly the same (\(B = 3.53, p <.001\)). The multiple \(R^2\) is a measure of goodness-of-fit of our model, i.e. it tells how well our model explains the variance of the dependent variable. From the \(R^2\) we can infer that the model explains approximately 19% of the variance of the exam score.
Let’s then examine how well our model meets the assumptions of linear regression. We can do this by inspecting diagnostic plots:
# dev.new(width=7, height=20)
par(mfrow=c(2,2))
plot(model3, c(1, 2, 5))
In the first figure the residuals are compared to the predicted values of the model. The assumption of constant variance of errors is met as the errors are quite randomly spread. From the QQ-plot we can see, that the errors are reasonably normally distributed, following the dashed line quite well. In the third figure we see that no values have a very high leverage, so this assumption is met as well. Based on these, it can be concluded that there are no significant problems with the model.
This week was all about joining two datasets together wrangling and learning logistic regression. Again, in the data wrangling part we created a data set to be used in the analyses below merging two datasets together and creating some new variables. Then we learned how to conduct a logistic regression and assess the predictive power of the model.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("dplyr")
library("tidyr")
library("ggplot2")
library("gridExtra")
library("Hmisc")
This data consists information about students’ alcohol consumption, school performance in two subjects (mathematics and portuguese language), and some background variables relating to their social environment, family, and school. Data has been collected from two Portuguese schools. More information of the data used can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
alc <- read.csv("alc.csv", sep=",", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures.m"
## [16] "schoolsup" "famsup" "paid.m" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences.m"
## [31] "G1.m" "G2.m" "G3.m" "failures.p" "paid.p"
## [36] "absences.p" "G1.p" "G2.p" "G3.p" "id.p"
## [41] "failures" "paid" "absences" "G1" "G2"
## [46] "G3" "alc_use" "high_use"
The data has been merged from two separate datasets, one specifically for math and other for portuguese language. The background variables are equal between the two original datasets, but the variables relating to school performance (failures, paid, absences, G1, G2, and G3) on these two subjects differ. These variables are the number of past class failures (failures), whether or not student has had extra paid classes (paid), number of absences (absences), and the first, second, and final grades (G1-G3, respectively). The final dataset used here therefore has three separate variables for each subject specific original variable: the two original variables denoted by either .p (for Portuguese) or .m (for math) and one that is the mean of the two original variables (without suffix).
I chose to examine how time spent on studies weekly (studytime), parents’ cohabitation status (Pstatus), number of school absences (absences) and the mean of the final grades in math and Portuguese (G3) affect whether student’s alcohol consumption is high or low (high_use). Study time is a discrete variable with 4 categories: less than 2 hours (1), 2-5 hours (2), 5-10 hours (3), and more than 10 hours (4). Parents’ cohabitation status has two possible values: living together (T) and living apart (A). Number of school absences is a continuous variable ranging from 0 to 93, and alcohol consumption is a categorical variable based on the mean of workday alcohol consumption and weekend alcohol consumption: those with mean of the two variables higher than 2 (ranging from 1, “very low”, to 5 “very high”) have been categorized as having high alcohol consumption.
My hypotheses are that more time spent on studying and higher grades are connected to low alcohol consumption, while higher number of absences and parents living apart are connected to high alcohol consumption.
Let’s have a closer look at the chosen variables and how they relate to each other.
# subset for only relevant variables
alc.sub <- subset(alc, select = c(studytime, absences, G3, Pstatus, high_use))
# see numerical summaries of variables
summary(alc.sub[(-4)])
## studytime absences G3 high_use
## Min. :1.000 Min. : 0.000 Min. : 0.00 Mode :logical
## 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:10.00 FALSE:259
## Median :2.000 Median : 3.000 Median :12.00 TRUE :111
## Mean :2.043 Mean : 4.511 Mean :11.52
## 3rd Qu.:2.000 3rd Qu.: 6.000 3rd Qu.:14.00
## Max. :4.000 Max. :45.000 Max. :18.00
table(alc.sub["Pstatus"])
##
## A T
## 38 332
# barplots for the distributions
# library(purrr)
numplot <- alc.sub[c("G3", "studytime", "absences")] %>% gather() %>% ggplot(aes(as.numeric(value))) + facet_wrap("key", scales = "free") + geom_bar() + xlab("value") + theme_bw() + theme(strip.background = element_rect(fill = "white"))
discplot <- alc.sub[c("Pstatus", "high_use")] %>% gather() %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme_bw() + theme(strip.background = element_rect(fill = "white"))
grid.arrange(numplot, discplot, nrow=2)
We can see that most of the students have quite low number of absences, with a few outliers who have over 40. The mean of the final grades is 11.5, and the distribution is somewhat skewed towards higher grades. Study time is mostly below 5 hours per week. There are around 1.5 times more of those whose alcohol consumption is categorized as low compared to high and most students’ parents live together.
# box plots for alcohol use vs. continuous variables
box1 <- ggplot(alc.sub, aes(x = high_use, y = G3)) +
geom_boxplot() +
ylab("grade") +
ggtitle("Alcohol use and final grades") +
theme_bw()
box2 <- ggplot(alc.sub, aes(x = high_use, y = absences)) +
geom_boxplot() +
ylab("absences") +
ggtitle("Alcohol use and absences") +
theme_bw()
# bar plots for alcohol use vs. discrete variables
bar1 <- alc.sub %>% ggplot(aes(x = studytime, fill = high_use)) +
geom_bar(position = "dodge") +
xlab("Study time") +
ggtitle("Alcohol use and study time") +
theme_bw()
bar2 <- alc.sub %>% ggplot(aes(x = Pstatus, fill = high_use)) + geom_bar(position = "dodge") +
xlab("Parental status") +
ggtitle("Alcohol use and parental status") +
theme_bw()
# draw the plots
grid.arrange(box1, box2, bar1, bar2, ncol=2)
From the initial glance it doesn’t seem like alcohol use and mean of the final grades or number of absences are connected. The study time per week seems to be a bit lower for those whose alcohol consumption is high, and those whose parents live apart might be overrepresented in high alcohol users as well, although as the number of parents living apart in this data is very low it’s hard to draw any definite conclusions based on these figures alone.
# correlations for numerical variables
corrs <- rcorr(as.matrix(alc.sub[(-4)]))
corrs
## studytime absences G3 high_use
## studytime 1.00 -0.12 0.17 -0.21
## absences -0.12 1.00 -0.10 0.22
## G3 0.17 -0.10 1.00 -0.13
## high_use -0.21 0.22 -0.13 1.00
##
## n= 370
##
##
## P
## studytime absences G3 high_use
## studytime 0.0254 0.0007 0.0000
## absences 0.0254 0.0484 0.0000
## G3 0.0007 0.0484 0.0112
## high_use 0.0000 0.0000 0.0112
# crosstabs for parental status and high use
table(alc.sub[c("Pstatus", "high_use")])
## high_use
## Pstatus FALSE TRUE
## A 26 12
## T 233 99
The rcorr() function prints us the correlations and the associated p-values of the variables. We can see that both study time and absences are statistically significantly correlated with alcohol use, with more time spent on studies being associated with low alcohol consumption (\(r = -.21, p<.001\)) and higher number of absences with high alcohol consumption (\(r = .22, p<.001\)). This would seem to support my hypotheses. Higher grades are also associated with more time spent on studying (\(r = .17, p<.001\)), but not statistically significantly to alcohol use. Although the direction of the association between mean of final grades and alcohol use is in line with the hypothesis.
The proportion of high alcohol users in both parental residential status groups is approximately the same, which does not support my hypothesis.
We can then use logistic regression to see if the results support my hypotheses.
# first, specify studytime as factor as it is actually a categorical variable
alc.sub$studytime <- as.factor(alc.sub$studytime)
# specify model
model1 <- glm(high_use ~ studytime + absences + G3 + Pstatus, data = alc.sub, family = "binomial")
# print summary
modelsum1 <- summary(model1)
modelsum1
##
## Call:
## glm(formula = high_use ~ studytime + absences + G3 + Pstatus,
## family = "binomial", data = alc.sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1429 -0.8596 -0.6513 1.1840 2.1990
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.22075 0.62094 -0.356 0.722211
## studytime2 -0.46568 0.26686 -1.745 0.080987 .
## studytime3 -1.36077 0.44121 -3.084 0.002041 **
## studytime4 -1.28496 0.58511 -2.196 0.028086 *
## absences 0.07614 0.02311 3.294 0.000986 ***
## G3 -0.05500 0.03682 -1.494 0.135216
## PstatusT 0.13698 0.40095 0.342 0.732632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 416.24 on 363 degrees of freedom
## AIC: 430.24
##
## Number of Fisher Scoring iterations: 4
# retrieve odds ratios, confidence intervals and p-values
OR <- coef(model1) %>% exp()
CI <- confint(model1) %>% exp()
p <- coef(modelsum1)[,4]
# print table of odds ratios, confidence intervals and p-values
cbind(OR, CI, p) %>% round(3)
## OR 2.5 % 97.5 % p
## (Intercept) 0.802 0.233 2.684 0.722
## studytime2 0.628 0.372 1.061 0.081
## studytime3 0.256 0.102 0.585 0.002
## studytime4 0.277 0.076 0.797 0.028
## absences 1.079 1.034 1.132 0.001
## G3 0.946 0.880 1.017 0.135
## PstatusT 1.147 0.535 2.608 0.733
In the specified model, weekly study time and absences are both associated statistically significantly to alcohol use. Those who study 5-10 hours (\(OR = 0.26, CI = 0.10, 0.59, p =.002\)) and over 10 hours per week (\(OR = 0.28, CI = 0.08, 0.80, p =.028\)) are less likely to be high alcohol users than those who study less than 5 hours per week, while those with more absences are more likely to be high alcohol users (\(OR = 1.08, CI = 1.04, 1.13, p <.001\)). The mean of final grades and parental status are not associated with alcohol use.
The hypotheses regarding study time and absences was confirmed, but the results don’t support the hypotheses about final grades and parental residential status being associated with alcohol use.
Only two of the hypothesized variables were connected to alcohol use. Let’s see how well a model which includes these two variables predicts alcohol use.
model2 <- update(model1,.~. -G3 -Pstatus)
summary(model2)
##
## Call:
## glm(formula = high_use ~ studytime + absences, family = "binomial",
## data = alc.sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1907 -0.8577 -0.7137 1.1984 2.1174
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.69160 0.23897 -2.894 0.003802 **
## studytime2 -0.50842 0.26455 -1.922 0.054631 .
## studytime3 -1.43763 0.43660 -3.293 0.000992 ***
## studytime4 -1.36272 0.58234 -2.340 0.019279 *
## absences 0.07788 0.02305 3.379 0.000727 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 418.70 on 365 degrees of freedom
## AIC: 428.7
##
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(model1, type = "response")
# add the predicted probabilities to 'alc'
alc.sub <- mutate(alc.sub, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc.sub <- mutate(alc.sub, prediction = ifelse(probability > 0.5, T , F))
# 2x2 cross table of alcohol use vs. the predictions
table(high_use = alc.sub$high_use, prediction = alc.sub$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 12
## TRUE 93 18
From the crosstabulation, we can see that the majority fall into the category where both predicted and actual alcohol use was low. Of the actual high alcohol users, most were predicted to be low alcohol users. All together, 93 persons were wrongly classified as low alcohol users, and 12 persons were wrongly classified as high alcohol users. This is not very surprising, given that the number belonging to each category in our data was very unequal, with most students being low alcohol users.
We can also examine predictions and actual alcohol use visually by drawing a scatterplot, and calculate the percentage of wrongly categorized individuals:
# explore the predictions graphically
g.pred <- ggplot(alc.sub, aes(x = probability, y = high_use, col = prediction))
g.pred + geom_point(alpha = 0.5) + geom_jitter(height=0.1) + theme_bw()
# calculate and print the total proportion of wrong classifications
prop_wrong <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
perc <- sum(n_wrong)/length(class) * 100
print(paste(round(perc,2), "% classified wrong"))
}
prop_wrong(class = alc.sub$high_use, prob = alc.sub$probability)
## [1] "28.38 % classified wrong"
The percentage of wrongly categorized individuals based on our model is 28.38 %. Although this number is quite high, it is notably better than simply guessing which would result in approximately 50% wrong guesses.
This week we learned about clustering and classification methods.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("MASS")
library("ggplot2")
library("dplyr")
library("GGally")
In this week analyses we use the Boston dataset from Mass package. The data describes housing values in suburbs of Boston from the 1970s. More information can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
# load the data
data("Boston")
# explore the structure and dimensions
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data consists 506 observations and 14 variables, which include town’s per capita crime rate, a number of variables reflecting the average socioeconomic status of the area and the demographics or the residents, general information about the location (e.g. proximity highways and the Charles River) and air quality.
Let’s have a closer look at the data and the variables.
plot1 <- ggpairs(Boston,
mapping = aes(alpha = 0.1),
lower = list(continuous = wrap("points", size = 0.3)),
upper = list(continuous = wrap("cor", size = 2.5)),
) +
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=4),
axis.text.y = element_text(size=5))
plot1
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Some of the variables are very heavily skewed towards lower values, most notably crim (the per capita rate by town), zn (proportion of residential land zoned for large lots) and chas (a dummy variable describing whether or not the area is next to the Charles River). Therefore the data consists of areas that mostly have low crime rates, smaller residential lots, and are not next to the river. The variables dis (distance to employment centers) and lstat (percent of the lower status of the population) are also skewed towards the lower values, so most areas are close to the employment centers and have a low percent of low status residents. On the other hand, variables age, ptratio, and black are skewed towards higher values, meaning the data includes more areas with older houses, high pupil-teacher ratio, and fewer black people. A couple of variables have very clear two peaks: indus (proportion of non-retail business acres), rad (accessibility to radial highways), and tax (property tax-rate). The variable nox, describing the nitrogen oxides concentration also has mostly quite low values, while medv, median value of owner-occupied homes, has its peak around the middle values.
There are a number of substantial correlations between the variables, and we can instantly see some interesting patterns in the data from the scatterplots, e.g. the nitrogen oxides concentration seems to be higher when the houses in the area are older, and lower when the distance to the employment centres is higher, and higher percent of the population that is of lower status seems to be associated with a decrease in the median value of owner-occupied homes. Nothing too surprising.
Next we can standardize the data using the function scale.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
This takes each value, calculates the difference between it and the variable mean, and divides it by the standard deviation of the variable. The scaled values represents by how many standard deviations does the observation differ from the variable mean. The means of each scaled variables is now 0 and the values that were originally below the mean are negative, and values above the mean are positive. For example, the maximum value of crime rates in the original data is 88.9762, which is 85.3626764 above the mean: the standard deviation of crime rates is 8.6015451, so the maximum value is 85.3626764 / 8.6015451 = 9.9241096 above the mean. This is the new maximum value in the scaled data.
Let’s then replace the crime rate variable with a categorical variable of the crime rate using quantiles as the break points.
# change the class to dataframe
boston_scaled <- as.data.frame(boston_scaled)
# define the breaks
bins <- quantile(boston_scaled$crim)
# create the categorical variable
boston_scaled$crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove the original crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
First, we will divide the dataset to train and test sets: we will first use the train set to do train our classification model and then see how well our model can predict the classes in the test dataset.
# choose randomly 80% of the rows in the data for the train set
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Next, we will fit the linear discriminant analysis on the train set using the categorical crime rate variable crime as the target variable, and all the other variables as predictors.
# Fit the model
lda.crime <- lda(crime ~., data = train)
lda.crime
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2574257 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 1.06218361 -0.9032299 -0.11325431 -0.8925217 0.5199204 -0.8981884
## med_low -0.06774513 -0.2995262 -0.08304540 -0.5842854 -0.1550518 -0.3431583
## med_high -0.35996279 0.1139669 0.34931834 0.3964225 0.2112513 0.4512791
## high -0.48724019 1.0170298 -0.08661679 1.0438239 -0.4354104 0.8123793
## dis rad tax ptratio black lstat
## low 0.8928300 -0.7034549 -0.7188704 -0.44836468 0.37743843 -0.78471097
## med_low 0.4289096 -0.5246930 -0.4298679 -0.05629691 0.31161083 -0.11230901
## med_high -0.3906318 -0.4499496 -0.3719643 -0.40198260 0.07866321 0.03264883
## high -0.8633493 1.6390172 1.5146914 0.78181164 -0.83601360 0.93663113
## medv
## low 0.56336593
## med_low -0.03210862
## med_high 0.25704132
## high -0.70897775
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0759924097 0.62808312 -1.12376486
## indus 0.1046319354 -0.04291061 0.26065904
## chas -0.1332305354 -0.12150575 -0.03164972
## nox 0.2722408982 -0.91412381 -1.12365791
## rm -0.1273565995 -0.11781650 -0.25667311
## age 0.2278116245 -0.36123138 -0.05211785
## dis -0.0551028171 -0.24631084 0.46080506
## rad 3.7576159308 0.80927186 -0.19159573
## tax 0.0009654825 0.12207901 0.67335656
## ptratio 0.0722684682 0.02953102 -0.36496311
## black -0.0911938987 0.03887020 0.10590572
## lstat 0.2361006972 -0.24791530 0.25327624
## medv 0.2193253548 -0.36265217 -0.10676869
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9556 0.0319 0.0125
# Draw the LDA biplot
# target classes as numeric
classes <- as.numeric(train$crime)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# define the colors for the plot from green (low) to red (high)
cols <- c("green", "blue", "purple", "red")
cols1 <- cols[classes]
# plot the lda results
plot(lda.crime, dimen= 2, col = cols1, pch = classes)
lda.arrows(lda.crime, myscale = 2)
We can see from the plot that high crime rates cluster together very clearly separate from the other categories. The other categories also form clusters, although there is much more overlap between the categories. The arrows show that the variable rad (the accessibility of radial highways) contributes most to the separation of the groups: those in the high crime rate group also have clearly higher values on this variable compared to other groups. Variables zn and nox seem to contribute more to the separation between the low, medium low, and medium high groups, with higher nitrogen oxides levels being related to higher crime rates, and proportion of large residential plots being related to lower crime rates. The same can be seen from the table of group means.
# save crime rate categories from test data
correct_classes <- test$crime
# remove original crime rate categories
test <- dplyr::select(test, -crime)
# predict classes with the model on the test data
lda.pred <- predict(lda.crime, newdata = test)
# crosstabulate results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 14 0 0
## med_low 5 11 6 0
## med_high 0 10 18 3
## high 0 0 0 21
It looks like that the predictions are pretty good: in all categories, the correct predictions are in the majority. The “high” group was predicted most accurately, which can also be seen from the previous plot, as the high values formed most clearly a separate cluster.
Let’s reload the Boston data, standardize it again, inspect the distances, and run the k-means algorithm on this data.
# reload the Boston data
data("Boston")
# standardize the dataset
boston_scaled <- scale(Boston)
# calculate distances between observations
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# run the k-means algorithm
km <- kmeans(boston_scaled, centers = 4)
km
## K-means clustering with 4 clusters of sizes 193, 65, 132, 116
##
## Cluster means:
## crim zn indus chas nox rm
## 1 -0.3925546 -0.1357810 -0.6436913 -0.007135788 -0.5651594 0.3794958
## 2 -0.4135478 2.2539281 -1.1502450 -0.211758295 -1.1815084 0.6543177
## 3 1.0632703 -0.4872402 1.0149946 -0.033716932 1.0159127 -0.3735788
## 4 -0.3250693 -0.4826198 0.5605091 0.168897683 0.4463218 -0.5729391
## age dis rad tax ptratio black
## 1 -0.4179401 0.3379844 -0.5581880 -0.7291262 -0.33462529 0.36406556
## 2 -1.4746155 1.6280013 -0.6585336 -0.5368747 -0.79454608 0.34634486
## 3 0.7542188 -0.8233749 1.6596029 1.5294129 0.80577843 -0.75124560
## 4 0.6634100 -0.5376344 -0.5907984 -0.2264164 0.08504675 0.05506337
## lstat medv
## 1 -0.5816331 0.4880272
## 2 -0.9492429 0.7282687
## 3 0.8328654 -0.6664074
## 4 0.5518771 -0.4617322
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 4 1 1 1 1 1 4 1 1 4 1 4
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 4 4 4 4 4 4 4 4 4 1 4 4 4 4 4 1 1 1 1 2
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 2 1 1 1 1 1 1 1 4 1 1 1 1 1 2 2 2 2 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 4 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4 4
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 4 4
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 4 1 1 1 4 4 1 4 4 4 4 4 4 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 2 2 2 2 2 1 1 4 1 4 4 4 4 1 4 1 4 1 4 4
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 1 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 2 2
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 2 1 2 2 1 1 1 1 1 4 4 1 4 1 1 4 4 4 1 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4
## 501 502 503 504 505 506
## 4 4 4 1 4 4
##
## Within cluster sum of squares by cluster:
## [1] 1147.7552 305.1560 1198.3572 836.7043
## (between_SS / total_SS = 50.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The k-means algorithm needs the number of clusters to be specified. I chose 4 clusters at random, but there is a better way to determine the appropriate number of clusters. We can do this by inspecting the total within cluster sum of squares (TWCSS):
set.seed(42)
# determine the maximum number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot shows TWCSS on y-axis and how it changes with the number of clusters on x-axis. Here I chose to plot the TWCSS for 1-10 clusters (as it is very unlikely that the solution would be better for a larger number of clusters). The rule of thumb for choosing the appropriate number of clusters is to find a “knot” in the line, e.g. the point where there is a big drop in the TWCSS. In this case, it seems that 2 is a good number of clusters.
Let’s run the k-means algorithm with two clusters.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
km
## K-means clustering with 2 clusters of sizes 329, 177
##
## Cluster means:
## crim zn indus chas nox rm
## 1 -0.3894158 0.2621323 -0.6146857 0.002908943 -0.5823397 0.2446705
## 2 0.7238295 -0.4872402 1.1425514 -0.005407018 1.0824279 -0.4547830
## age dis rad tax ptratio black lstat
## 1 -0.4331555 0.4540421 -0.5828749 -0.6291043 -0.2943707 0.3282754 -0.4530491
## 2 0.8051309 -0.8439539 1.0834228 1.1693521 0.5471636 -0.6101842 0.8421083
## medv
## 1 0.3532917
## 2 -0.6566834
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
## 501 502 503 504 505 506
## 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 2686.045 1890.637
## (between_SS / total_SS = 35.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# plot the Boston dataset with clusters
boston_scaled <- as.data.frame(boston_scaled)
plot2 <- ggpairs(boston_scaled,
mapping = aes(col = as.factor(km$cluster), alpha = 0.1),
lower = list(continuous = wrap("points", size = 0.1)),
upper = "blank",
switch = "both")+
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=4),
axis.text.y = element_text(size=5))
plot2
We can see that cluster 1 is described by somewhat higher crime rates and distinctively lower amount of large residential plots, higher nitrogen oxides concentration and more older buildings. It also contains areas with shorter distance to employment centres, higher accessibility to highways and property tax rate, as well as higher pupil-teacher ratio, and higher percent of lower status population. Cluster 2 seems to have low crime rate, high proportion of large residential plots, lower amount of non-retail business land, and lower nitrogen oxides concentration, as well as lower accessibility to highways, and lower number of blacks and percent of low status population. In short, our two clusters seem to describe two types of areas, one with generally more desirable attributes for a residential area and quite likely residents of higher socioeconomic status, and other with less desirable attributes and residents of lower socioeconomic status. If I would have to guess, I’d say these areas probably differ in their proximity to the city proper, with cluster 1 describing more urban areas and cluster 2 describing more suburban areas. (more chapters to be added similarly as we proceed with the course!)